{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Extracting Data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "In this chapter, you'll learn about some different ways to extract data: from the web, from documents, and elsewhere. This chapter uses packages such as **pandas-datareader** and **BeautifulSoup** that you may need to install first.\n",
    "\n",
    "### Imports\n",
    "\n",
    "First we need to import the packages we'll be using"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import requests\n",
    "from bs4 import BeautifulSoup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set max rows displayed for readability\n",
    "pd.set_option(\"display.max_rows\", 6)\n",
    "# Plot settings\n",
    "plt.style.use(\n",
    "    \"https://github.com/aeturrell/coding-for-economists/raw/main/plot_style.txt\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting data from files on the internet\n",
    "\n",
    "As you will have seen in some of the examples in this book, it's easy to read data from the internet once you have the url and file type. Here, for instance, is an example that reads in the 'storms' dataset (we'll only grab the first 10 rows):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pd.read_csv(\n",
    "    \"https://vincentarelbundock.github.io/Rdatasets/csv/dplyr/storms.csv\", nrows=10\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting data using APIs\n",
    "\n",
    "Using an API (application programming interface) is another way to draw down information from the interweb. Their just a way for one tool, say Python, to speak to another tool, say a server, and usefully exchange information. The classic use case would be to post a request for data that fits a certain query via an API and to get a download of that data back in return. (You should always preferentially use an API over webscraping a site.)\n",
    "\n",
    "Because they are designed to work with any tool, you don't actually need a programming language to interact with an API, it's just a *lot* easier if you do.\n",
    "\n",
    "```{note}\n",
    "An API key is needed in order to access some APIs. Sometimes all you need to do is register with site, in other cases you may have to pay for access.\n",
    "```\n",
    "\n",
    " To see this, let's directly use an API to get some time series data. We will make the call out to the internet using the **requests** package.\n",
    "\n",
    "An API has an 'endpoint', the base url, and then a URL that encodes the question. Let's see an example with the ONS API for which the endpoint is \"https://api.ons.gov.uk/\". The rest of the API has the form 'key/value', for example we'll ask for timeseries data 'timeseries' followed by 'JP9Z' for the vacancies in the UK services sector. We then ask for 'dataset' followed by 'UNEM' to specify which overarching dataset the series we want is in. The last part asks for the data with 'data'. Often you won't need to know all of these details, but it's useful to see a detailed example.\n",
    "\n",
    "The data that are returned by APIs are typically in JSON format, which looks a lot like a nested Python dictionary and its entries can be accessed in the same way--this is what is happening when getting the series' title in the example below. JSON is not good for analysis, so we'll use **pandas** to put the data into shape."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import requests\n",
    "\n",
    "url = \"https://api.ons.gov.uk/timeseries/JP9Z/dataset/UNEM/data\"\n",
    "\n",
    "# Get the data from the ONS API:\n",
    "json_data = requests.get(url).json()\n",
    "\n",
    "# Prep the data for a quick plot\n",
    "title = json_data[\"description\"][\"title\"]\n",
    "df = (\n",
    "    pd.DataFrame(pd.json_normalize(json_data[\"months\"]))\n",
    "    .assign(\n",
    "        date=lambda x: pd.to_datetime(x[\"date\"]),\n",
    "        value=lambda x: pd.to_numeric(x[\"value\"]),\n",
    "    )\n",
    "    .set_index(\"date\")\n",
    ")\n",
    "\n",
    "df[\"value\"].plot(title=title, ylim=(0, df[\"value\"].max() * 1.2), lw=3.0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've talked about *reading* APIs. You can also create your own to serve up data, models, whatever you like! This is an advanced topic and we won't cover it; but if you do need to, the simplest way is to use [Fast API](https://fastapi.tiangolo.com/). You can find some short video tutorials for Fast API [here](https://calmcode.io/fastapi/hello-world.html).\n",
    "\n",
    "### An easier way to interact with (some) APIs\n",
    "\n",
    "Although it didn't take much code to get the ONS data, it would be even better if it was just a single line, wouldn't it? Fortunately there are some packages out there that make this easy, but it does depend on the API (and APIs come and go over time).\n",
    "\n",
    "By far the most comprehensive library for accessing extra APIs is [**pandas-datareader**](https://pandas-datareader.readthedocs.io/en/latest/), which provides convenient access to:\n",
    "\n",
    "- FRED\n",
    "- Quandl\n",
    "- World Bank\n",
    "- OECD\n",
    "- Eurostat\n",
    "\n",
    "and more.\n",
    "\n",
    "Let's see an example using FRED (the Federal Reserve Bank of St. Louis' economic data library). This time, let's look at the UK unemployment rate:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas_datareader.data as web\n",
    "\n",
    "df_u = web.DataReader(\"LRHUTTTTGBM156S\", \"fred\")\n",
    "\n",
    "df_u.plot(title=\"UK unemployment (percent)\", legend=False, ylim=(2, 6), lw=3.0);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Webscraping\n",
    "\n",
    "Webscraping is a way of grabbing information from the internet that was intended to be displayed in a browser. But it should only be used as a last resort, and only then when permitted by the terms and conditions of a website.\n",
    "\n",
    "If you're getting data from the internet, it's much better to use an API whenever you can: grabbing information in a structure way is *exactly* why APIs exist. APIs should also be more stable than websites, which may change frequently. Typically, if an organisation is happy for you to grab their data, they will have made an API expressly for that purpose. I'd say it's pretty rare that there's a major website which *does* permit webscraping but which doesn't have an API; for these websites, if they don't have an API, chances scraping is against their terms and conditions. Those terms and conditions may be enforceable by law (different rules in different countries here, and you really need legal advice if it's not unambiguous as to whether you can scrape or not.)\n",
    "\n",
    "There are other reasons why webscraping is not so good; for example, if you need a back-run then it might be offered through an API but not shown on the webpage. (Or it might not be available at all, in which case it's best to get in touch with the organisation or check out WaybackMachine in case they took snapshots).\n",
    "\n",
    "So I'm pretty down on webscraping as there's almost always a better solution. However, there are occasionally times when it is useful.\n",
    "\n",
    "If you do find yourself in a scraping situation, be really sure to check that's legally allowed and also that you are not violating the website's `robots.txt` rules: this is a special file on almost every website that sets out what's fair play to crawl (conditional on legality) and what robots should not go poking around in.\n",
    "\n",
    "In Python, you are spoiled for choice when it comes to webscraping. There are five very strong libraries that cover a real range of user styles and needs: **requests**, **lxml**, **beautifulsoup**, **selenium**, and *scrapy**.\n",
    "\n",
    "For quick and simple webscraping, my usual combo would **requests**, which does little more than go and grab the HTML of a webpage, and **beautifulsoup**, which then helps you to navigate the structure of the page and pull out what you're actually interested in. For dynamic webpages that use javascript rather than just HTML, you'll need **selenium**. To scale up and hit thousands of webpages in an efficient way, you might try **scrapy**, which can work with the other tools and handle multiple sessions, and all other kinds of bells and whistles... it's actually a \"web scraping framework\".\n",
    "\n",
    "It's always helpful to see coding in practice, so that's what we'll do now, but note that we'll be skipping over a lot of important detail such as user agents, being 'polite' with your scraping requests, being efficient with caching and crawling.\n",
    "\n",
    "In lieu of a better example, let's scrape [http://aeturrell.com/](http://aeturrell.com/)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "url = \"http://aeturrell.com/\"\n",
    "page = requests.get(url)\n",
    "page.text[:300]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Okay, what just happened? We asked requests to grab the HTML of the webpage and then printed the first 300 characters of the text that it found.\n",
    "\n",
    "Let's now parse this into something humans can read (or can read more easily) using beautifulsoup:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "soup = BeautifulSoup(page.text, \"html.parser\")\n",
    "print(soup.prettify()[500:1000])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we see more structure of the page and even some *HTML tags* such as 'title' and 'link'. Now we come to the data extraction part: say we want to pull out every paragraph of text, we can use beautifulsoup to skim down the HTML structure and pull out only those parts with the paragraph tag ('p').\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Get all paragraphs\n",
    "all_paras = soup.find_all(\"p\")\n",
    "# Just show one of the paras\n",
    "all_paras[52]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To make this more readable, you can use the `.text` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "all_paras[52].text"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's say we didn't care about most of the page, we *only* wanted to get hold of the names of projects. For this we need to identify the tag type of the element we're interested in, in this case 'div', and it's class type, in this case \"project-name\". We do it like this (and show nice text in the process):\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "projects = soup.find_all(\"div\", class_=\"project-name\")\n",
    "projects = [x.text.strip() for x in projects]\n",
    "projects"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Hooray! We managed to get the information we wanted: all we needed to know was the right tags. A good tip for finding the tags of the info you want is to look at in your browser (eg Google Chrome) and then right-click on the bit you're interested in, then hit 'Inspect'. This will show you the HTML element of the bit of the page you clicked on.\n",
    "\n",
    "That's almost it for this very, very brief introduction to webscraping. We'll just see one more thing: how to iterate over multiple pages.\n",
    "\n",
    "Imagine we had a root webpage such as \"www.codingforeconomists.com\" which had subpages such as \"www.codingforeconomists.com/page=1\", \"www.codingforeconomists.com/page=2\", and so on. One need only iterate create the HTML strings to pass into a function that scrapes each one and return the relevant data, eg for the first 50 pages, and with a function called `scraper`, one might run\n",
    "\n",
    "```\n",
    "start, stop = 0, 50\n",
    "root_url = \"www.codingforeconomists.com/page=\"\n",
    "info_on_pages = [scraper(root_url + str(i)) for i in range(start, stop)]\n",
    "```\n",
    "\n",
    "That's all we'll cover here but remember we've barely *scraped* the surface of this big, complex topic. If you want to read about an application, it's hard not to recommend the paper on webscraping that has undoubtedly change the world the most, and very likely has affected your own life in numerous ways: [\"The PageRank Citation Ranking: Bringing Order to the Web\"](http://ilpubs.stanford.edu:8090/422/) by Page, Brin, Motwani and Winograd. For a more in-depth example of webscraping, check out realpython's [tutorial](https://realpython.com/python-web-scraping-practical-introduction/)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Webscraping Tables\n",
    "\n",
    "Often there are times when you don't actually want to scrape an entire webpage and all you want is the data from a *table* within the page. Fortunately, there is an easy way to scrape individual tables using the **pandas** package.\n",
    "\n",
    "We will read data from the first table on 'https://simple.wikipedia.org/wiki/FIFA_World_Cup' using **pandas**. The function we'll use is `read_html`, which returns a list of dataframes of all the tables it finds when you pass it a URL. If you want to filter the list of tables, use the `match=` keyword argument with text that only appears in the table(s) you're interested in.\n",
    "\n",
    "The example below shows how this works; looking at the website, we can see that the table we're interested in (of past world cup results), has a 'fourth place' column while other tables on the page do not. Therefore we run:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df_list = pd.read_html(\n",
    "    \"https://simple.wikipedia.org/wiki/FIFA_World_Cup\", match=\"Fourth Place\"\n",
    ")\n",
    "# Retrieve first and only entry from list of dataframes\n",
    "df = df_list[0]\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This gives us the table neatly loaded into a **pandas** dataframe ready for further use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Extracting data from PDFs\n",
    "\n",
    "PDFs are great. Unfortunately, some people love them so much that they think they're an appropriate way to store data rather than a convenient way to share text and/or figures. Or perhaps there's a table in a PDF that you'd legitimately like to get the info out from.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Extracting images and text from PDFs\n",
    "\n",
    "We'll use [**pdfminer.six**](https://pdfminersix.readthedocs.io/en/latest/) to get text out of the same PDF. For this simple operation, it's just a single command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "from pdfminer.high_level import extract_text\n",
    "\n",
    "# Download the pdf_with_table.pdf file from\n",
    "# https://github.com/aeturrell/coding-for-economists/blob/main/data/pdf_with_table.pdf\n",
    "# and put it in a subfolder called data before running the next line\n",
    "text = extract_text(Path(\"data/pdf_with_table.pdf\"))\n",
    "\n",
    "# Show only first 200 characters of text:\n",
    "print(text[:200])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This uses the 'high-level' part of the library, which is designed to help cover the most common use cases. But there's a lot more flexibility there if you need it: essentially, **pdfminer** will break up a pdf into a series of elements that you can sift through in the much the same way as **beautifulsoup** does for HTML. For example, you can use library to [extract images](https://pdfminersix.readthedocs.io/en/latest/howto/images.html) from PDFs too."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Tables\n",
    "\n",
    "The single best solution to grab tables is probably [**camelot**](https://camelot-py.readthedocs.io/en/master/index.html). Note that it does need you to have Ghostscript installed on your computer; you can find more information about the dependencies [here](https://camelot-py.readthedocs.io/en/master/user/install.html). It only works with text-based PDFs and not scanned documents: basically, if you can click and drag to select text in your table in a PDF viewer, then your PDF is text-based. In that case, **camelot** is able to sift through the contents and grab any tables and then pass them back as csvs or even **pandas** dataframes.\n",
    "\n",
    "At the time of writing, **camelot** had some versioning issues related to a dependency on an outdated version of **sqlalchemy**. You may need to install it in a separate virtual environment to use it.\n",
    "\n",
    "Here's a small example that assumes you have a pdf with a table in stored in a local directory:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "```python\n",
    "import camelot\n",
    "# Grab the pdf\n",
    "tables = camelot.read_pdf(os.path.join('data', 'pdf_with_table.pdf'))\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To extract any of the $n$ tables that are retrieved into a pandas dataframe individually, use `tables[0].df`. \n",
    "\n",
    "Note that **camelot** is not perfect--so it can produce a report on how it did when it tried to extract each table, which includes an accuracy score. This is found using, for example, `tables[0].parsing_report`.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Other useful PDF packages\n",
    "\n",
    "[**pdfcomments**](https://github.com/hoffmangroup/pdfcomments) is a library that allows you to strip out comments and sticky notes from PDF. This need not strictly be about data extraction, but [**PyPDF2**](https://pythonhosted.org/PyPDF2/) allows you to both split a PDF into separate pages and merge multiple PDFs together; see [here](http://www.blog.pythonlibrary.org/2018/04/11/splitting-and-merging-pdfs-with-python/) for the steps. (You can drag and drop pages using preview on MacOS but this library may be the easiest way to do the same thing on Windows.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Text extraction\n",
    "\n",
    "Not everything is a PDF file! If you want to get the text out of .doc, .docx, .epub, .gif, .json, .jpg, *.mp3*, .odt, .pptx, .rtf, .xlsx, .xls and, actually, .pdf too, then **textract** is for you. Mostly, it's a wrapper around a ton of other libraries. The upside is that getting the text out should be as easy as calling\n",
    "\n",
    "```python\n",
    "import textract\n",
    "text = textract.process(Path('path/to/file.extension'))\n",
    "```\n",
    "\n",
    "The downside is that it requires that some other (non-Python) libraries be installed and it doesn't (yet) work on Windows."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Review\n",
    "\n",
    "If you know how to get data from:\n",
    "\n",
    "- ✅ the internet using a URL;\n",
    "- ✅ the internet using an API;\n",
    "- ✅ the internet using webscraping; and\n",
    "- ✅ PDFs, Microsoft Word Documents, and more, using tools like **pdfminer.six** and **textract**\n",
    "\n",
    "then you have a good basic set of tools for getting the data that you need into a form you can use."
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "codeforecon",
   "language": "python",
   "name": "codeforecon"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
